-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass the systemd service restart limit and do immediately restart when change to local mode #15432
Bypass the systemd service restart limit and do immediately restart when change to local mode #15432
Conversation
Your code changes is more than what you said in PR title and PR description. Could you update them? In reply to: 1602113471 |
Thanks for your comments, have updated, could you please go ahead to review. |
for the first pr, i think it should be a separate pr. I spent quite some time to figure where which line code map to this reset failed. even for this feature, we need to explain on this "It's easy to meet this limit when upgrade and fallback happen at the same time." why? i couldn't figure out. |
Maybe the word "easy" cause it confused, actually not that easy, it's an extreme case. Let me explain first how the systemd sevice restart when k8s upgrade the container. For example v1(kube) --> v2(kube), k8s will stop v1 container first, this time the systemd service is doing "docker wait v1-id", once the v1 container stops, the "docker wait v1-id" will return error code and the systemd service will exit with error code, due to the restart policy, the systemd service will restart, and the failed number will +1. But the failed number limit is only 3 within 20 minutes, it means if we do three times upgrade or fallback within 20 minutes, the systemd service will be never up. For a possible example, if fallback happens when upgrade, the failed number will be 2, just once again, the systemd service will be down. So, we need to reset-failed number before we do systemd restart. |
in this case, it is planned stop, so why docker wait will return error code? can we make this like planned stopped with error code = 0? |
The reason is that k8s will kill the container, so the docker wait result is not zero. There is one way that we can check whether the wait id is the feature name or not. Feature name means it's a local container. Not feature name means it's a kube container, if it's a kube container, after docker wait returns, we could not care the docker wait returns code and we can return 0 directly, this maybe a solution. I can try to implement to verify it's a feasible solution. Latest reply reference: |
…/sonic-buildimage into bypass-systemd-restart-limit
…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
…start when change to local mode (#15432) (#15839) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
Cherry-pick PR to 202305: #15868 |
…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
Cherry-pick PR to 202211: #15869 |
…start when change to local mode (#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.
Why I did it
Work item tracking
24172368
How I did it
How to verif it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)